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Abstract 

Background: Identifying the location of transcription factor bindings is crucial to understand transcriptional 
regulation. Currently, Chromatin Immunoprecipitation followed with high-throughput Sequencing (ChlP-seq) is able 
to locate the transcription factor binding sites (TFBSs) accurately in high throughput and it has become the gold- 
standard method for TFBS finding experimentally. However, due to its high cost, it is impractical to apply the 
method in a very large scale. Considering the large number of transcription factors, numerous cell types and 
various conditions, computational methods are still very valuable to accurate TFBS identification. 

Results: In this paper, we proposed a novel integrated TFBS prediction system, CTF, based on Conditional 
Random Fields (CRFs). Integrating information from different sources, CTF was able to capture patterns of TFBSs 
contained in different features (sequence, chromatin and etc) and predicted the TFBS locations with a high 
accuracy. We compared CTF with several existing tools as well as the PWM baseline method on a dataset 
generated by ChlP-seq experiments (TFBSs of 13 transcription factors in mouse genome). Results showed that CTF 
performed significantly better than existing methods tested. 

Conclusions: CTF is a powerful tool to predict TFBSs by integrating high throughput data and different features. It 
can be a useful complement to ChlP-seq and other experimental methods for TFBS identification and thus 
improve our ability to investigate functional elements in post-genomic era. 

Availability: CTF is freely available to academic users at: http://cbb.sjtu.edu.cn/~ccwei/pub/software/CTF/CTF.php 



Introduction 

Functional elements in genomes play important roles in 
many biology processes. For example, enhancers, silencers, 
and transcriptional factor binding sites (TFBSs) are 
required in transcription. Thus, identifying functional ele- 
ments in genomes is one of most important problems in 
post-genomic era [1-3], which is essential to elucidate 
gene regulation comprehensively. TFBS is one important 
type of functional elements. However, it is very challenging 
to locate the actual positions of TFBSs because they are 
generally very short (10 ~ 20 bp) and highly degenerate. 
Besides, only a small fraction of their patterns in a genome 
are actually bounded by transcription factors [4-6] . 
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Recently, the advance of experimental technology gready 
expands our ability to detect the locations of TFBSs. ChlP- 
seq (chromatin immunoprecipitation followed by mas- 
sively parallel sequencing) [7] technology is utilized to find 
out the binding motifs in a high accuracy and a high 
throughput. ChlP-seq is becoming the gold-standard 
method for TFBS identification. However, it has several 
limitations: 1). the quality and source of the antibody 
have a big impact on the result and it is hard to obtain 
high quality antibodies for all TFs; 2). its resolution (about 
300 bps) is too low [8] to locate TFBSs, which are only 
about 20 bps; 3). Another major limitation is that ChlP- 
seq could detect the binding sites of only one transcription 
factor in one experiment and it is expensive. Although 
recent study showed that it was possible to identify bind- 
ing sites of more than one TFs using a single ChlP-seq 
experiment [9], the cost is stiU prohibitively expensive to 
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identify binding sites of many TFs in various cell types and 
conditions. Thus, computational methods are required as 
complementary means for TFBS identifying. 

Efforts have been made to predict TFBSs computation- 
ally by searching patterns of TFBSs in genome. Position 
weight matrix (PWM) [10], which contains TFBS patterns 
in sequence level, is the most widely used model to repre- 
sent and identify TFBSs. However, since the motifs are 
very short and typically degenerated, PWM alone is not 
discriminative enough and will predict a large number of 
false positives. Recently, various approaches have been 
proposed to reduce false positives by integrating informa- 
tion from other sources [11-14]. For example, histone 
markers were shown to correlate with transcription factor 
binding sites and were able to improve the accuracy signif- 
icantly [13,15]. However, the co-occurrence of histone 
markers was not considered in all these methods men- 
tioned above. The co-occurrence of histone markers was 
shown to reflect the state of chromatin and correlated 
with the binding events of transcription factors[16]. 

In this paper, we present CTF (CRF-based TFBS Finding 
system), a novel method to identify TFBSs. Figure 1 
showed the system diagram of CTF. Conditional Random 
Field (CRF) framework [17,18] was employed as the under- 
lying model of CTF. CRF was introduced to bioinformatics 
area recently, such as gene prediction[19,20], and present 
promising results. CRF can capture sophisticated depen- 
dency and integrate information from different sources. 
Therefore it is an ideal framework for TFBS prediction. 

Three types of features, the Position Weight Matrix 
(PWM), the distance to Transcription start sites (TSS 
proximity), and histone markers (8 distinct histone modifi- 
cations), have been integrated into CTF (See Additional file 
1 for more details). Test datasets were collected forl3 tran- 
scription factors in mouse Embryonic Stem cells (ES cells). 
It is shown that by integrating PWM, histone markers and 
TSS proximity, CTF is able to predict TFBSs with high 
accuracy and it outperforms existing methods, including 
Chromia[13] and Cluster-Buster[21] significantiy. 

Results 

Accurately predicting TFBSs by integrating PWIVl, TSS 
proximity and chromatin signature 

CTF was evaluated on 13 TFBS datasets. Several features 
were assessed. First, traditional position weight matrix 
(PWM) model was used. Figure 2 shows the average 
PWM score in bins with or without TFBS inside. Those 
with TFBSs were with higher PWM scores, especially for 
the binding sites of CTCF, Klf4 and Zfx. Still, the PWM 
scores of binding sites of Smadl, Sox2 and Nanog failed 
to distinct themselves from the background. 

CTF also integrated histone markers and transcription 
start site (TSS) data. Histone modifications were observed 
across genome and some of them correlated strongly with 



TFBSs[13]. In addition, by studying the combination of 
different histone modifications, it was shown that chroma- 
tin states were related to activity of genomic regions and 
regulation events[ll]. Therefore, histone markers and 
their combinations were informative for the prediction of 
TFBSs. In our work, 8 distinct histone markers were used: 
H3K27me3, H3K36me3, H3K4mel, H3K4me2, H3K4me3, 
H3K9me3, H3 and H4K20me3. Another feature included 
was the TSS proximity. It was an indicator of whether a 
bin was within 2 kb of a TSS, the promoter region defined 
in this paper. The discriminative power of each histone 
marker could be measured by counting the frequency dif- 
ference of a certain feature in bins with TFBSs and in bins 
without TFBSs. As Won et al presented that H3K4me2 
and H3K4me3 were the most discriminative, while 
H3K4mel was less discriminative [13]. It was consistent 
with our knowledge that H3K4mel/2/3 were active mar- 
kers. In addition, we have observed the enrichment of 
binding sites of some TFs such as c-Myc and Zfx in pro- 
moter regions (Additional file 3) and the enrichment can 
be captured by the TSS-proximity feature. 

To evaluate the contribution of each feature, we tested 
CTF models that combined different features. In consistent 
with previous analysis, CTF with PWM and H3K4mel 
(AUC = 0.84) or H3K4me2 (AUC = 0.86) or H3K4me3 
(AUC = 0.82) showed superior performance than CTF 
with PWM and any other single feature (Figure 3). Also, 
integration of TSS proximity was able to improve the accu- 
racy (AUC = 0.77) compared with model based solely on 
PWM (AUC = 0.75). Though, some other features made 
little contribution and related models showed similar per- 
formance as the baseline method that solely based on 
PWM. In the final combination, all features were included. 
We did not select features because the number of motifs in 
our dataset was very large which made it possible to 
include many features with a low risk of over fitting. Also, 
during the training of CTF, unrelated features would be 
assigned with weights close to zero. Combining all features, 
the final CTF (AUC = 0.91) outperformed all models with 
less features by at least 5%. This result demonstrated that 
CTF was able to integrate different information effectively 
and make better prediction. 

Comparison with other methods 

To further evaluate CTF, we then compared CTF with a 
couple of prevalent existing algorithms. Chromia[13] is an 
integrated method based on Hidden Markov Model 
(HMM) and it predicts TFBSs based on PWM and chro- 
matin signatures. Cluster-Buster[21] is a algorithm to find 
motif clusters (or cis-regulatory modules), which is also 
based on PWM. Cluster-Buster considers not only the sig- 
nal (PWM score) of motifs but also their co-occurrence. 

These tools were tested on the 13 TFBS datasets. The 
AUCio% was calculated as the measurement of 
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Figure 1 System diagram of CTF. This figure showed the system diagram of CTF. Different features (PWM, chromatin signatures, and any other 
features such as the distance to a TSS) were integrated into CFF. Combining these features, CTF called CRF frameworl< to obtain the probability- 
ike scores of TFBSs across the whole genome. The CRF model of CTF contained two states: transcription factor binding site (TFBS) state and 
non-TFBS (bacl<ground) state. 



performance (See Methods for details). Figure 4 present 
the accuracy distribution for different TFBS identification 
methods on the 13 datasets. Results showed that CTF 
achieved significantly better performance (AUG = 0.073) 
than all other methods. Also, the results showed CTF 
and Chromia outperformed PWM method, which 
implied that integration of histone markers was necessary 
and could indeed significantly improve the accuracy. Sur- 
prisingly, Cluster-Buster showed slightly worse AUCio% 
than PWM. However, the results of Cluster-Buster on 
Sox2 and Oct4 were slightly better than PWM. It was 
known that Sox2, Oct4 and Nanog were able to form a 
complex and their motifs were very close to each other. 



From previous results, it turned out that only Chromia 
was comparable to CTF in terms of AUCio% on test data- 
sets. Then, we compared CTF, Chromia and PWM in 
terms of complete AUC as well as the true positive rate 
at 1% false positive rate. PWM was used as the baseline 
method. ROC curves of all three methods on data of 
STAT3 and E2fl were shown in Figure 5 and results on 
all TFs were shown in Additional file 2. Results showed 
that CTF had better accuracy than other two methods. 
Table 1 listed the results of all 13 TFBS datasets. CTF 
performed the best in all datasets. On average, AUC of 
CTF was larger than AUC of Chromia by 3%. Next, we 
also compared the true positive rate (TPR) of all three 
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c-Myc CTCF E2f1 ESrrb Klf4 Nanog n-Myc Oct4 Smad1 Sox2 STATS Tcfcp2l1 Zfx 

Figure 2 PWM score comparison for TFBSs and background sequences. This figure showed the average position weight matrix scores in 
bins that contain transcription factor binding sites (TFBSs) and bins that do not (bacl<ground) on the datasets of 13 transcription factors. 



methods at 1% false positive rate (FPR). The results were 
shown in Additional file 3. On average, the CTF had the 
highest TPR (0.55), which was much better than TPR of 
other two methods (TPRchromia = 0.33 and TPRp^x^M = 
0.23). To sum up, CTF outperformed existing methods in 
different metrics. 

Discussion 

CTF, a novel integrative TFBS prediction system, was pro- 
posed in this paper. Although CTF achieved a high 



accuracy, there are still much room for improvement. For 
example, in current version, only the locations of the 
peaks of histone modifications were considered in CTF. 
Continuous feature functions that score the shape and 
intensity could be included in the future versions. In addi- 
tion, the CRF framework itself is very flexible and new fea- 
tures can be added into CTF in a straightforward manner. 
CTF can also be applied to similar problems such as the 
prediction of enhancers. We expect that CTF can facilitate 
the identification of binding sites of transcription factors 
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Figure 3 Accuracy (AUC) for PWM and CTF with different features. The figure showed the average AUCs (area under curve) of different 
models on the dataset of 1 3 TFBSs. Models included PWM and CTF with different combinations of features. The AUCs of CTF models were 
computed using 10-fold cross-validation, while the AUC of PWM was measured directly. 
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Figure 4 Accuracy (AUC10%) for different TFBS identification methods. This figure showed the average AUC,o"/„ (see Methods for details) of 
different TFBS prediction tools on the dataset of 13 TFBSs. The AUC of CTF models was computed using 10-fold cross-validation. The AUCs of 
other methods were measured directly from their results. 
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Figure 5 ROC for CTF, Chromia and PWM on data of STAT3 and E2f1. This figure showed the ROCs of CTF, Chromia and PWM on the 

dataset of E2f1 (left) and STATS (right). CTF was the CTF model with all features and its ROC was obtained by using 10-fold cross-validation and 
changing the threshold. ROC of Chromia was calculated using the data and model contained in its release. ROC of PWM was got by scoring 
directly. The complete list of results for all 13 TFs were shown in Additional file 2. 
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Table 1 The AUC of CTF, Chromia and PWM on the 
dataset of 13 TFs 
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0.91 
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as well as other functional elements, and improve our 
knowledge about gene regulation. 

Conclusions 

In this paper, we present and evaluated CTF, a novel inte- 
grative method to predict transcription factor binding sties 
(TFBSs) by combining various features using conditional 
random field as the underlying framework. Our results 
showed that CTF successfully integrated position weight 
matrix (PWM), distance to transcription start sites (TSSs) 
and 8 distinct histone markers, which in total improved 
accuracy of TFBS prediction significantly. It outperformed 
models with only part of those features. Most importantly, 
when compared with some existing representative tools, 
CTF showed significant superior performance. CTF is an 
effective novel integrative TFBS prediction system, and 
has a great potential in other functional element finding. 

Methods 

CRF-based TFBS finding system 

CTF system has been created to predict transcription 
factor binding sites (TFBSs) by integrating information 
from different sources. The system diagram of CTF was 
shown in Figure 1. In CTF, a genome is divided into 
200 bp bins first. Then, the conditional probability-like 
score of a label sequence (TFBS and non-TFBS) given 
an observation sequence was computed as follows. 

exp (e E ^kfk {yt, yt-i, t, x)^ 



y \t=i k=i 



where y is the label sequence or annotation of all bins, 
X is the observed genomic sequence is the k * feature 
functions and is the corresponding weight. The fea- 
ture function ^ can be an arbitrary function on x and y' 
is any label sequence. In CTF, the possible values for 
label sequence of one bin is 0 (non-TFBS) and 1 (TFBS). 

Feature design 

In CTF, several types of feature functions have been 
designed to capture patterns contained in features. The 
first type of feature functions are PWM scoring func- 
tions. The second type of feature functions are indicator 
functions. Each of these indictor function checks the 
occurrence of a feature. For example, a feature function 
of this type can be interpreted as an indicator of a bin 
in a promoter region if the j-th feature is TSS proximity, 
or an indicator of a bin within an H3K4me2 peak if that 
i-th feature corresponds to H3K4me2. The third type of 
feature functions targets the co-occurrence of two his- 
tone markers. This type of feature functions are able to 
capture co-occurring features such as a bivalent domain 
[22] or a bin that is "not in a promoter region or 
H3K4me3", which is a marker of active enhancer [23]. In 
addition, we have defined feature functions to capture 
patterns in adjacent bins as a complement for the above 
feature functions. With these feature functions, CTF is 
able to distinguish TFBSs from the background with 
high accuracy. 

Different function templates were created for different 
types of feature functions in CTF. Let x be a feature 
matrix (note, x is not a genomic sequence here. See 
Additional file 1 for more details), then Xt^ j is the ele- 
ment in i-th row and j-th column, i.e. the value of i-th 
feature in the j-th bin in the genome. The first row cor- 
responds to PWM scores. Similarly, y is the label 
sequence (or annotation sequence) and is the label of 
the y-th bin (1 for TFBS and 0 otherwise). I{conditions} 
is denoted as an indicator function and its value is 1 if 
and only if all conditions hold. The first type of feature 
function is for PWM. It is defined as below, 

/(y;'y;-i,j,x) =xij/{y; = u}, 

where m is 0 or 1 which will be the label of that bin. It 
is the only type of real value function in CTF. The sec- 
ond kind of feature functions are designed to capture 
the occurrence of features. It is defined as 

f (yj,Yi-i,},x} = I{Yi-i = uandyj = vandxi^j = 1}, 

where both u and v are labels. The third type of fea- 
ture function targets the co-occurrence of two histone 
markers and its definition is 
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f [Yj,Yj-i,i,x) = I[Yj-i = uandyj = vandxij = landxiij = 1), 

where i and /' corresponds to two histone markers. 
This kind of feature functions are able to capture co- 
occurring features such as bivalent domains [22] or "not 
in a promoter region or H3K4me3", which is a marker 
of active enhancer [23]. At last, feature functions to cap- 
ture patterns in adjacent bins as a complement for 
above feature functions are defined as 

/()'j,)'j_i,j,x) = I[yj-i = uandyj = vandxfj-i = landxij = 1}, 

and 

/(yj.yj-ijj.x) = /{y,-i = uandyj = vandxfj-i = landxij = landxcj^i = 1) 

where i and corresponds to two features and u and 
V are tags. 

Training 

To estimate the parameter vector X, we use a Regular- 
ized Maximum Conditional Log Likelihood method as 

^ML = argmax (ln(p (y|x; A,)) n) 

X 

That is 

A.ML = arg max Xufk - In (Z(x)) - (2) 
where Z{x) = J2^P\J2Y1 ^^f^ ('''i' >^t-i ' ^ ) is the 

y' \t=l fe=l / 

partition function and 1 1 1 1 is the L-2 norm. In CTF, 
liblbfgs (http://www.chokkan.org/software/liblbfgs/), an 
open source library for unconstrained minimization, was 
used to find the optimal weight vector, X,. 

Prediction 

To predict a label for each bin, we estimated the mar- 
ginal probability of y-th bin to be TFBS as 

= P (y, = 1 |x; A.) , 

which is assigned as the score of each bin. Thus, we 
can set a threshold and bins will be assigned as TFBSs if 
their scores exceed the threshold. The rest bins are 
assigned as background. 

Data 

The binding sites of 13 transcription factors (TFs) in the 
mouse ES cells were obtained directly from the ChlP- 
seq data of Chen et al. [24] The 13 TFs were c-Myc, 
CTCF, E2fl, ESrrb, Klf4, Nanog, n-Myc, Oct4, Smadl, 
Sox2, STATS, Tcfcp211 and Zfx. The position weight 
matrices (PWM) of TFs were obtained from JASPAR 



[28] and PWMs not available in JASPAR were obtained 
from Chen et al[24]. The locations of transcription start 
sites (Refseq mm8, April 8, 2012) were obtained from 
UCSC genome browser[25]. Also, the sequence of 
mouse genome (mm8, April 8, 2012) was downloaded 
from UCSC Genome Browser. Original ChlP-seq data 
on 8 distinct histone modification information was 
obtained from [26]. MACS [27] was employed with 
default parameters to call peaks from ChlP-seq data. 

Generating gold-standard TFBS dataset and feature 
matrix 

"Peak-centric" [15] method was used to generate gold- 
standard dataset on the binned genome. First, mouse gen- 
ome was divided into 200bp bins. Then, we assigned bins 
overlapped with the centers of TFBSs as positive ones and 
other bins as negative ones. Similar strategy was applied to 
generate a feature matrix (Additional file 1). The PWM 
score assigned to a certain bin was the maximal PWM 
score inside the bin. Then, for other features, the value 
corresponding to a histone modification of a certain bin 
was set to 1 if that bin overlapped with one peak and 0 
otherwise. As for transcription start site (TSS) proximity, 
we defined the promoter region as a 4,000-bp interval 
centred at the TSS and if bins overlapped with that region, 
their values of TSS proximity were set to 1; otherwise, 
they were 0. 

Performance evaluation 

In order to evaluate the performance of CTF, 10-fold 
cross-validation was employed. In the cross validation, 
19 autosomes and chromosome X in mouse genome 
were randomly divided into 10 groups. Then, one 
group was utilized as test set and the rest as the train- 
ing set. To measure the performance, we calculated 
Area Under the Curve (AUC) of Receiver Operator 
Characteristic (ROC) curve. ROC curve is a curve of 
True Positive Rate (TPR) vs. False Positive Rate (FPR) 
by changing the threshold of the model. For some 
methods, we were unable to get enough prediction to 
plot the complete ROC curve. Thus, in the comparison 
of all methods, we only computed the area under ROC 
curve when FPR was less than 10%, which was denoted 
as AUCio%. Another rationale was that in this range, 
the number of false positives was moderated and the 
model was useful. 

We defined True Positives (TPs) as positive bins that 
were predicted as TFBSs and False Positives (FPs) as 
non-TFBS bins that were predicted as TFBSs. Similarly, 
negative bins predicted as non-TFBSs were termed True 
Negatives (TNs). Negative bins predicted as positives 
were defined as False Negatives (FNs). Then, True Posi- 
tive Rate (TPR) was defined as the fraction of TPs called 
by a model in all positives. False Positive Rate (FPR) was 
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defined as the fraction of FPs called by a model in all 
negatives. 

In order to evaluate other methods with the same cri- 
terion, we put TFBSs predicted by other methods into 
bins according to their positions and the scores of those 
bins became the scores of corresponding TFBSs. If there 
were several TFBSs in one bin, the maximal score was 
chosen as the score of the bin. In this manner, we could 
measure the performance of all methods with the same 
criterion. 

Running other methods 

We compared CTF with two existing methods and the 
baseline PWM method. The two existing methods 
were Chromia[13] and Cluster-Buster[21]. Chromia 
was downloaded from its website (http://tabit.ucsd. 
edu/download/Chromia2.tar.gz). Since the current 
release of Chromia contained the prediction result files 
generated from the same data set used in this paper, 
the results of Chromia was used directly. After this, 
predicted TFBSs were merged to bins and the results 
were then evaluated. Cluster-buster focused on detect- 
ing clustered motifs within a relatively narrow range, 
and did not consider epigenetic modification informa- 
tion. Cluster-Buster was run with parameters, "-c 1 -m 
1 -g 20 -f 2". Position weight matrix (PWM) baseline 
method used solely the PWM score of every bin to 
identify TFBSs and we used various cut-offs to draw 
the ROC curves. 

Additional material 



List of abbreviations 

TFBS (transcription factor binding site); ChlP-seq (chromatin 
immunoprecipitation followed by massively parallel sequencing); CRF 
(conditional random field); CFF (CRF-based TFBS finding system); TP (true 
positive); TN (true negative); FP (false positive); FN (false negative);. FPR (false 



positive rate); TPR (true positive rate); PWM (position v\(eight matrix); ROC 
(receiver operating characteristic); AUC (area under the curve). 
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